Today, we’ll finally leave base R behind and make our lives much easier by introducing you to the tidyverse. ✨ We think for data cleaning, wrangling, and plotting, the tidyverse really is a no-brainer. A few good reasons for teaching the tidyverse are:
Read more on this here if you like.
But… this certainly shouldn’t put you off learning base R alternatives.
In general, the tidyverse is a collection of R packages that share an underlying design, syntax, and structure. One very prominent tidyverse package is dplyr for data manipulation.
In this tutorial, you will learn to:
tidy datadplyr verbstidy syntaxdplyr verbs to solve your data manipulation challengesGenerally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:
Excercise 1
Think about what you've learned about panel surveys (surveys fielded to the same set of participants at multiple points of time). Why is this a problem for the general assumptions of tidy data and what possible data formats can you end up with?
Let’s load the tidyverse meta-package and check the output.
library(tidyverse)We see that we have actually loaded a number of packages (which could also be loaded individually): ggplot2, tibble, dplyr, etc. We can also see information about the package versions and some namespace conflicts.
The tidyverse actually comes with a lot more packages than those that are just loaded automatically.1
tidyverse_packages()## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "googledrive"
## [9] "googlesheets4" "ggplot2" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"
We’ll use several of these additional packages during the remainder of this course.
E.g. The lubridate package for working with dates and the rvest package for webscraping. However, these packages will have to be loaded separately!
%>% operatorThe tidyverse loads its a pipe operator, denoted %>%. Using pipes can dramatically improve the experience of reading and writing code. Compare:
## These next two lines of code do exactly the same thing.
mpg %>% filter(manufacturer=="audi") %>% group_by(model) %>% summarise(hwy_mean = mean(hwy))
summarise(group_by(filter(mpg, manufacturer=="audi"), model), hwy_mean = mean(hwy))The first line reads from left to right, exactly how I thought of the operations in my head. - Take this object (mpg), do this (filter), then do this (group_by), etc.
The second line totally inverts this logical order (the final operation comes first!) - Who wants to read things inside out?
The piped version of the code is even more readable if we write it over several lines. Here it is again and, this time, I’ll run it for good measure so you can see the output:
mpg %>%
filter(manufacturer=="audi") %>%
group_by(model) %>%
summarise(hwy_mean = mean(hwy))Remember: Using vertical space costs nothing and makes for much more readable/writeable code than cramming things horizontally.
PS — The pipe is originally from the magrittr package, which can do some other cool things if you’re inclined to explore.
dplyrIn this tutorial, you’ll learn and practice examples using some functions in dplyr to work with data. Those are:
select(): keep or exclude some columnsfilter(): keep rows that satisfy your conditionsmutate(): add columns from existing data or edit existing columnsgroup_by(): lets you define groups within your data setsummarize(): get summary statisticsarrange(): reorders the rows according to single or multiple variablesTo demonstrate and practice how these verbs (functions) work, we’ll use the penguings dataset.
The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:
select()The first verb (function) we will utilize is select(). We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names() to retrieve the names of the variables in a data frame.
names(penguins)## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:
select(data, columns)
Excercise 2 The following code chunk would select the variables we need. Can you adapt it, so that we keep the body_mass_g and sex variables as well?
dplyr::select(penguins, species, island, year)Good to know: To drop variables, use - before the variable name, i.e. select(penguins, -year)] to drop the year column (select everything but the year column).
filter()The second verb (function) we will employ is filter(). filter() lets you use a logical test to extract specific rows from a data frame. To use filter(), pass it the data frame followed by one or more logical tests. filter() will return every row that passes each logical test.
The more commonly used logical operators are:
==: Equal to!=: Not equal to>, >=: Greater than, greater than or equal to<, <=: Less than, less than or equal to&, |: And, orSay we are interested in retrieving the observations from the year 2007. We would do:
dplyr::filter(penguins, year == 2007)Excercise 3 Can you adapt the code to retrieve all the observations of Chinstrap penguins from 2007
Excercise 4 We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunck with the pipe and conditional logical operators we discussed?
only_2009 <- dplyr::filter(penguins, year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console
mutate()mutate() lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more confortable with assessing the weight of the penguins in pounds. We would utilize mutate() as such:
mutate(new_var_name = conditions)
penguins %>%
dplyr::mutate(body_mass_lbs = body_mass_g/453.6)group_by() and summarize()These two verbs group_by() and summarize() tend to go together. When combined , ’summarize()` will create a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. For example:
# compare this output with the one below
penguins %>%
dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T)) penguins %>%
dplyr::group_by(species) %>%
dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T))Excercise 5 Can you get the weight of the lightest penguin of each species? You can use min(). What happens when in addition to species you also group by year group_by(species, year)?
penguins %>%
dplyr::group_by(species, year) %>%
dplyr::summarize(lightest_penguin = min(body_mass_g, na.rm = T))arrange()The arrange() verb is pretty self-explanatory. arrange() orders the rows of a data frame by the values of selected columns in ascending order. You can use the desc() argument inside to arrange in descending order. The following chunk arranges the data frame based on the length of the penguins’ bill. You hint tab contains the code for the descending order alternative.
```r
penguins %>%
dplyr::arrange(bill_length_mm)
penguins %>%
dplyr::arrange(desc(bill_length_mm))Excercise 6 Can you create a data frame arranged by body_mass_g of the penguins observed in the “Dream” island?
Quiz
How long was the longest observed bill of a Gentoo penguin in 2008?
How long was the shortest observed bill of a Gentoo penguin in 2008?
ungroup(): For ungrouping data after using the group_by() command - Particularly useful with the summarise and mutate commands, as we’ve already seen.
slice(): Subset rows by position rather than filtering by values. - E.g. penguins %>% slice(c(1, 5))
pull(): Extract a column from as a data frame as a vector or scalar. - E.g. penguins %>% filter(gender=="female") %>% pull(height)
count() and distinct(): Number and isolate unique observations. - E.g. penguins %>% count(species), or penguins %>% distinct(species) - You could also use a combination of mutate, group_by, and n(), e.g. penguins %>% group_by(species) %>% mutate(num = n()).
There are also a whole class of window functions for creating cumulative aggregates, etc. - See vignette("window-functions").
The final set of dplyr “goodies” are the family of join operations. These are important enough that we want to go over some concepts in a bit more depth.
However - note that we will cover relational data structures (and SQL) in even more depth in a separate lab session!
dplyrOne of the mainstays of the dplyr package is merging data with the family join operations.
inner_join(df1, df2)left_join(df1, df2)right_join(df1, df2)full_join(df1, df2)semi_join(df1, df2)anti_join(df1, df2)(You might find it helpful to to see visual depictions of the different join operations here.)
For the simple examples that I’m going to show here, we’ll need some data sets that come bundled with the nycflights13 package. - Load it now and then inspect these data frames in your own console.
Let’s perform a left join on the flights and planes datasets. - Note: I’m going subset columns after the join, but only to keep text on the slide.
left_join(flights, planes) %>%
select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, type, model)%>%
head(3) ## Just to save vertical space in outputNote that dplyr made a reasonable guess about which columns to join on (i.e. columns that share the same name). It also told us its choices:
## Joining, by = c("year", "tailnum")
However, there’s an obvious problem here: the variable “year” does not have a consistent meaning across our joining datasets! - In one it refers to the year of flight, in the other it refers to year of construction.
Luckily, there’s an easy way to avoid this problem. - Try ?dplyr::join.
You just need to be more explicit in your join call by using the by = argument. - You can also rename any ambiguous columns to avoid confusion.
left_join(
flights,
planes %>% rename(year_built = year), ## Not necessary w/ below line, but helpful
by = "tailnum" ## Be specific about the joining column
) %>%
select(year, month, day, dep_time, arr_time, carrier, flight, tailnum, year_built, type, model) %>%
head(3) Last thing I’ll mention for now; note what happens if we again specify the join column… but don’t rename the ambiguous “year” column in at least one of the given data frames.
left_join(
flights,
planes, ## Not renaming "year" to "year_built" this time
by = "tailnum"
) %>%
select(contains("year"), month, day, dep_time, arr_time, carrier, flight, tailnum, type, model) %>%
head(3)Make sure you know what “year.x” and “year.y” are. Again, it pays to be specific.
Let’s move on to another important tidyverse package!
tidyr packageKey tidyr verbs are:
pivot_longer: Pivot wide data into long format (i.e. “melt”).1
pivot_wider: Pivot long data into wide format (i.e. “cast”).2
separate: Separate (i.e. split) one column into multiple columns.
unite: Unite (i.e. combine) multiple columns into one.
pivot_longer()stocks = data.frame( ## Could use "tibble" instead of "data.frame" if you prefer
time = as.Date('2009-01-01') + 0:1,
X = rnorm(2, 0, 1),
Y = rnorm(2, 0, 2),
Z = rnorm(2, 0, 4)
)
stocksstocks %>% pivot_longer(-time, names_to="stock", values_to="price")Let’s quickly save the “tidy” (i.e. long) stocks data frame
## Write out the argument names this time: i.e. "names_to=" and "values_to="
tidy_stocks <- stocks %>%
pivot_longer(-time, names_to="stock", values_to="price")pivot_wider()tidy_stocks %>% pivot_wider(names_from=stock, values_from=price)tidy_stocks %>% pivot_wider(names_from=time, values_from=price)Note that the second example; which has combined different pivoting arguments; has effectively transposed the data.
tidyr::separate()economists = data.frame(name = c("Adam.Smith", "Paul.Samuelson", "Milton.Friedman"))
economistseconomists %>% separate(name, c("first_name", "last_name")) This command is pretty smart. But to avoid ambiguity, you can also specify the separation character with separate(..., sep=".").
A related function is separate_rows, for splitting up cells that contain multiple fields or observations (a frustratingly common occurence with survey data).
jobs = data.frame(
name = c("Jack", "Jill"),
occupation = c("Homemaker", "Philosopher, Philanthropist, Troublemaker")
)
jobs## Now split out Jill's various occupations into different rows
jobs %>% separate_rows(occupation)tidyr::unite()gdp = data.frame(
yr = rep(2016, times = 4),
mnth = rep(1, times = 4),
dy = 1:4,
gdp = rnorm(4, mean = 100, sd = 2)
)
gdp ## Combine "yr", "mnth", and "dy" into one "date" column
gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-")Note that unite will automatically create a character variable. You can see this better if we convert it to a tibble.
gdp_u = gdp %>% unite(date, c("yr", "mnth", "dy"), sep = "-") %>% as_tibble()
gdp_uIf you want to convert it to something else (e.g. date or numeric) then you will need to modify it using mutate. See below for an example, using the lubridate package’s super helpful date conversion functions.
library(lubridate)
gdp_u %>% mutate(date = ymd(date))Let us remind you again, the key to learning R is: Google! We can only give you an overview over basic R functions, but to really learn R you will have to actively use it yourself, trouble shoot, ask questions, and google! It is very likely that someone else has had the exact same or just similar enough issue before and that the R community has answered it with 5+ different solutions years ago. 😉
This tutorial is partly based on R for Data Science, section 5.2, Quantitative Politics with R, chapter 3; as well as the Tidyverse Session in the course Data Science for Economists by Grant McDermott.